The DARPA Image Understanding Motion Benchmark
نویسندگان
چکیده
Benchmarks and test suites are an essential element of the architectural evaluation process. At the conclusion of the last DARPA workshop on vision benchmarks to test the performance of parallel architectures, it was recommended that the DARPA Image Understanding Benchmark [Weems, 1991] be extended with a second level task to add motion and tracking to the original task. We have now developed this new benchmark and a sample solution. This paper describes the benchmark, and presents some timing results for various common workstations. 1. History of the DARPA Benchmark Effort One of the first parallel processor benchmarks to address vision-related processing was the Abingdon Cross benchmark, defined at the 1982 Multicomputer Workshop in Abingdon, England [Preston, 1986]. In that benchmark, an input image was specified that consisted of a dark background with a pair of brighter rectangular bars, equal in size, that cross at their midpoints and are centered in the image, and with Gaussian noise added to the entire image. The goal of the exercise was to determine and draw the medial axis of the cross formed by the two bars. The results obtained from solving the benchmark problem on various machines were presented at the 1984 Multicomputer Workshop in Tanque Verde, Arizona, and many of the participants spent a fairly lengthy session discussing problems with the benchmark and designing a new benchmark that it was hoped would solve those problems. It was the perception of the Tanque Verde group that the major drawback of the Abingdon Cross was its lack of breadth. The problem required a reasonably small repertoire of image processing operations to construct a solution. The second concern of the group was that the specification did not constrain the a priori information that could be used to solve the problem. In theory, a valid solution would have been to simply draw the medial lines since their true positions were known. Although this was never done, there was argument over whether it was acceptable for a solution to make use of the fact that the bars were oriented horizontally and vertically in the image. A final concern was that no method was prescribed for solving the problem, with the result that every solution was based on a different method. When a benchmark can be solved in different ways, the performance measurements become more difficult to compare because they include an element of programmer cleverness. Also, the use of a consistent method would permit some comparison of the basic operations that make up a complete solution. See [Duff, 1986; Carpenter, 1987] for deeper discussions of these issues. The Tanque Verde group specified a new benchmark, called the Tanque Verde Suite, that consisted of a large collection of individual vision-related problems. A list of twenty-five problems was developed, of which was to be further defined by a member of the group, who would also generate test data for their assigned problem. Unfortunately, only a few of the problems were ever developed, and none of them were widely tested on different architectures. Thus, while the simplicity of the Abingdon Cross may have been criticized, it was the respondent complexity of the Tanque Verde Suite that inhibited its use. In 1986, a new benchmark was developed at the request of the Defense Advanced Research Projects Agency (DARPA). Like the Tanque Verde Suite, it was a collection of vision-related problems, but the set of problems that made up the new benchmark was much smaller and easier to implement. Just eleven problems comprised this benchmark. A workshop was held in Washington, D.C., in November of 1986 to present the results of testing the benchmark on several machines, with those results summarized in [Rosenfeld, 1987]. The consensus of the workshop participants was that the results could not be compared directly for several reasons. First, as with the Abingdon Cross, no method was specified for solving any of the problems. Thus, in many cases, the timings were more indicative of the knowledge or cleverness of the programmer, than of a machine's true capabilities. Second, no input data was provided and the specifications allowed a wide range of possible inputs. Thus, some participants chose to test a worst-case input, while others chose "average" input values that varied considerably in difficulty. The workshop participants pointed out other shortcomings of the benchmark. Chief among these was that because it consisted of isolated tasks, it did not measure performance related to the interactions between the components of a vision system. For example, there might be a particularly fast solution to a problem on a given architecture if the input data is arranged in a special manner. However, this apparent advantage might be inconsequential if a vision system does not normally use the data in such an arrangement, and the cost of rearranging the data is high. Another shortcoming was that the problems had not been solved before they were distributed. Thus, there was no canonical solution on which the participants could rely for a definition of correctness, and there was even one problem for which it turned out there was no practical solution. The issue of having a ground truth, or known correct solution was considered very important, since it is difficult to compare the performance of two architectures when they produce different results. For example, is an architecture that performs a task in half the time of another really twice as powerful if the first machine's programmer used integer arithmetic while the second machine was programmed to use floating point, and they thus obtained significantly different results? Since problems in vision are often ill-defined, it is possible to argue for the correctness of many different solutions. In a benchmark, however, the goal is not to solve a vision problem but to test the performance of different machines doing comparable work. One conclusion from the benchmark exercise was that a new benchmark should be developed that addresses the shortcomings of the preceding benchmarks. Specifically, the new benchmark should test system performance on a task that approximates an integrated solution to a machine vision problem. A complete solution with test data sets should be constructed and distributed with the benchmark specification. And, every effort should be made to specify the benchmark in such a way as to minimize the opportunities for taking shortcuts in solving the problem. The task of constructing the this Image Understanding Benchmark, was assigned to the vision research groups at the University of Massachusetts at Amherst, and the University of Maryland. Following the 1986 meeting, a preliminary benchmark specification was drawn up and circulated among the DARPA image understanding community for comment. A revised specification was then programmed on a standard sequential machine. A set of five test cases were developed, along with a sample parallel solution for a commercial multiprocessor. In March of 1988, the benchmark was released. The benchmark consisted of the sequential and parallel solutions, the five test cases, and software for generating additional test data. The benchmark specification was presented at the DARPA Image Understanding Workshop, the International Supercomputing Conference, and the Computer Vision and Pattern Recognition conference [Weems, 1988]. Over 25 academic and industrial groups obtained copies of the benchmark release. Nine of those groups developed either complete or partial versions of the solution for an architecture. A workshop was held in October of 1988 to present those results to members of the DARPA research community. As with the previous workshops, the participants spent a session developing a critique and making recommendations for the design of the next benchmark [Weems, 1991]. The conclusions of the participants include various improvements to the engineering of the benchmark code, all of which have since been carried out. In addition, it was recommended that a second level of the benchmark be specified that extends the current problem to a sequence of images with motion analysis. The second level would be an optional exercise that could be built on top of the current problem to demonstrate specific real-time capabilities of certain architectures. It is this second level of the benchmark that is presented here. 2. Benchmark Task Overview The overall task that is to be performed by this benchmark is the recognition and tracking of an approximately specified 2 1/2 dimensional "mobile" sculpture that is moving in a cluttered environment, given a series of synthetic images from simulated intensity and range sensors. These scenes follow the same pattern as the static version of the DARPA IU Benchmark, but in the new benchmark, the mobile and chaff are blown around the scene by an idealized wind to produce predictable motion. The motion involves movement of the entire mobile as a unit, and movement of its individual components. The motions are both translational and rotational, and they are controlled by reasonably realistic physical constraint models. The new benchmark is meant to supplement, rather than replace, the earlier benchmark, which tests system performance at the kernel operation level within the framework of a larger task. We recommend that developers begin by implementing the earlier benchmark on their machines, and then the motion benchmark can be more easily constructed by reusing the code modules from the first benchmark. The goal of the new benchmark is to extend the testing of system performance for a longer period of time so that, for example, caches and page tables will be filled and achieve steady-state behavior. The benchmark also explores I/O and real-time capabilities of the systems under test, and involves more high-level processing. Thus, the combination of the two benchmarks allows developers to analyze the performance and behavior of systems both at a fine level of granularity on a single burst of processing, and at a coarser granularity under a sustained load. Unlike the previous benchmark, there are no fixed data sets. Given the number of frames that must be processed in a single test, it is too unwieldy to prepare the input data for distribution. Instead, we have developed a data set generator that can be used to repeatably produce the same image sequence from a set of input parameters. Users thus have the flexibility to generate data sets for digital storage, output to video media, or in real time by one machine for direct input to the machine under test. A standard set of parameters for the model generator is supplied to serve as the canonical benchmark data set for system performance comparison purposes. 2.1 The Image Environment The sculpture is a collection of two-dimensional rectangles of various sizes, brightnesses, two-dimensional orientations, and depths. Each rectangle is oriented normal to the Z axis (the viewing axis), with constant depth across its surface, and the images are constructed under orthographic projection. Thus an individual rectangle has no intrinsic depth component, but depth is a factor in the spatial relationships between rectangles. Hence the notion that the sculpture is 2 1/2 dimensional. Conceptually, the elements of the sculpture are linked by an invisible set of horizontal rods and vertical suspension wires. A model is constructed as a tree structure where the links in the tree represent the invisible links in the sculpture. Each node of the tree contains depth, size, orientation, and intensity information for a single rectangle. The child links of a node in the tree describe the spatial relationships between that node and certain other nodes below it. This model is further endowed with physical properties, such as twist-induced torque on the wires, that lend a degree of reality to the motions of the model elements and constrain their relationships and trajectories. However, rectangles may intersect the invisible links, as if they are not physically present. The clutter in the scene consists of additional rectangles, with sizes, brightnesses, two-dimensional orientations, and depths that are similar to those of the sculpture. Rectangles may partially or completely occlude other rectangles. It is also possible for a rectangle to disappear when another rectangle of the same brightness or slightly greater depth is located directly behind it. The chaff rectangles are modeled as physically independent of each other, such that their motions do not influence each other. The intensity and depth sensors are precisely registered with each other and both have a resolution of 512 x 512 pixels. There is no averaging or aliasing in either of the sensors. A pixel in the intensity image is an 8-bit integer grey value. In the depth image a pixel is a 32-bit floating-point range value. The intensity image is noise free, while the depth image has added Gaussian noise. The motion of both the model elements and the chaff are driven by a “wind” that blows with constant speed and direction through the scene. The scene elements are also constrained to two-dimensional motions, i.e., they do not change their depth in the scene but merely move laterally at their assigned depths. 2.2 Processing of the Images As with the earlier benchmark, an initial set of models is provided that represent a collection of similar sculptures, and an initial searching phase involves identifying the model which best matches the object present in the scene. None of the initial models is a perfect match for the sculpture in the scene, and so it is necessary to identify the closest match. The scenario that the designers imagined in constructing the problem was a semi-rigid "mobile", viewed from above, with bits and pieces of other mobiles blowing through the scene. The initial state of the system is that previous processing has narrowed the range of potential matches to a few similar sculptures, and has oriented them to correspond with information extracted from a previous image. However, the objects in the scene have since moved, and a new set of images has been taken prior to completing the matching process. The system must make its final choice for a best match, and update the corresponding model with the positional information extracted from the latest images. Once the model has been recognized, it is located in a second set of images and the displacements of its component parts are used to predictively track their positions in subsequent frames.
منابع مشابه
DARPA February 1992 ATIS Benchmark Test Results
This paper documents the third in a series of Benchmark Tests for the DARPA Air Travel Information System (ATIS) common task domain. The first results in this series were reported at the June 1990 Speech and Natural Language Workshop [1], and the second at the February 1991 Speech and Natural Language Workshop [2]. The February 1992 Benchmark Tests include: (1) ATIS domain spontaneous speech re...
متن کاملAn Optimization Based Framework for Human Pose Estimation in Monocular Videos
Human pose estimation using monocular vision is a challenging problem in computer vision. Past work has focused on developing efficient inference algorithms and probabilistic prior models based on captured kinematic/dynamic measurements. However, such algorithms face challenges in generalization beyond the learned dataset. In this work, we propose a model-based generative approach for estimatin...
متن کاملOn the deformation of image intensity and zero-crossing contours under motion
Image intensity and edge are two major sources of information for estimating the motion in the image plane. The 2-D motion obtained by analyzing the deformation of intensity and/or edges is used to recover the 3-D motion and structure. In this paper we show that the motion defined by the image intensity differs from the motion revealed by the (zerocrossing) edge. Understanding of this discrepan...
متن کاملBenchmark Tests For The Darpa Spoken Language Program
This paper documents benchmark tests implemented within the DARPA Spoken Language Program during the period November, 1992 January, 1993. Tests were conducted using the Wall Street Journal-based Continuous Speech Recognition (WSJ-CSR) corpus and the Air Travel Information System (ATIS) corpus collected by the Multi-site ATIS Data COllection Working (MADCOW) Group. The WSJ-CSR tests consist of t...
متن کاملاصلاح حرکت در تصویربرداری اسپکت میوکارد با استفاده از مدلسازی منحنی چند جمله ای
Background and purpose: Patient motion during myocardial perfusion SPECT can produce artifacts in reconstructed images which might affect clinical diagnosis. This paper attempts to present a new approach for the detection and correction of cardiac motion utilizing the data obtained during the imaging process. Materials and methods: Our method quantifies motion through polynomial curves modelin...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007